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Abstract 

In this paper, we categorize fine-grained images with¬ 
out using any object /part annotation neither in the train¬ 
ing nor in the testing stage, a step towards making it suit¬ 
able for deployments. Fine-grained image categorization 
aims to classify objects with subtle distinctions. Most ex¬ 
isting works heavily rely on object /part detectors to build 
the correspondence between object parts by using object or 
object part annotations inside training images. The need 
for expensive object annotations prevents the wide usage 
of these methods. Instead, we propose to select useful 
parts from multi-scale part proposals in objects, and use 
them to compute a global image representation for cate¬ 
gorization. This is specially designed for the annotation- 
free fine-grained categorization task, because useful parts 
have shown to play an important role in existing annotation- 
dependent works but accurate part detectors can be hardly 
acquired. With the proposed image representation, we can 
further detect and visualize the key (most discriminative) 
parts in objects of different classes. In the experiment, the 
proposed annotation-free method achieves better accuracy 
than that of state-of-the-art annotation-free and most ex¬ 
isting annotation-dependent methods on two challenging 
datasets, which shows that it is not always necessary to 
use accurate object / part annotations in fine-grained im¬ 
age categorization. 

1. Introduction 

Fine-grained image categorization has been popular dur¬ 
ing the past few years. Different from traditional image 
recognition such as scene or object recognition, fine-grained 
categorization deals with images with subtle distinctions. 


which usually involves the classification of subclasses of 
objects belonging to the same class like birds m [II, 
dogs ifThll . planes ||25l, plants lfT^l2^ . etc. Therefore, it re¬ 
quires methods that are more discriminative than traditional 
image classification. 

One important common feature of existing fine-grained 
methods is that they explicitly use annotations of the object 
or even object parts to depict an object as precise as possi¬ 
ble. Most of them heavily rely on object / part detectors to 
find the part correspondence among objects. 

For example, in |[^l3^ . the poselet El is used to detect 
object parts. Then, each object is represented with a bag 
of poselets, and suitable matchings among poselets (parts) 
could be found between two objects. Instead of using pose¬ 
lets, ll34l used the deformable part models (DPM) ||9l for 
object part detection. DPM is learned from the annotated 
object parts in training objects, which is then applied on 
testing objects to detect parts. Some works mniiii transfer 
the part annotations from objects in training images to those 
sharing similar shapes in testing images instead of applying 
object / part detectors. Instead of seeking precise part lo¬ 
calization, cni also provided an unsupervised object align¬ 
ment technique, which roughly aligns objects and divides 
them into corresponding parts along certain directions. It 
achieves better results than the label transfer method. Re¬ 
cently, 1(3^ proposed to use object and part detectors with 
powerful CNN feature representations 0, which achieves 
state-of-the-art results on the Caltech-UCSD Birds (CUB) 
200-2011 ll2^ dataset. The geometric relation between an 
object and its parts are considered in ||3^ . iTSl also shows 
that part-based models with CNN features is able to capture 
subtle distinctions among objects. Some other works BET) 
recognize fine-grained images with human in the loop. 

In order to achieve accurate part detection, most existing 
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works require the annotated bounding boxes for objects, in 
both training and testing stages. As pointed out in ll^ . such 
a requirement is not so realistic for practical usage. Thus, 
a few works 1^ 1^ have looked into a more realistic setup, 
i.e., only utilize the bounding box in the training stage but 
not in the testing stage. However, even with such a setup, 
it is still hard for wide deployment of these methods since 
accurate object annotations needed in the training stage are 
usually expensive to acquire, especially for large-scale im¬ 
age classification problems. It is an interesting research 
problem to free us from the dependency to detailed man¬ 
ual annotations in fine-grained image categorization tasks. 
||28]| has shown promising results without using manual an¬ 
notations. They try to detect accurate objects and parts with 
complex deep learning models for fine-grained recognition. 

In this paper, we aim at categorizing fine-grained images 
with only category labels and without any bounding box 
annotation in both training and testing stages, while not 
degrading the categorization accuracy. This is a big step 
towards making fine-grained image categorization suitable 
for wide deployments. In existing annotation-dependent 
works, representative parts like head and body in birds 1^ 
have shown to play the key role in capturing the subtle dif¬ 
ference of fine-grained images. They are different from gen¬ 
eral image recognition, which usually uses a holistic image 
representation. In this paper, we are going to select the most 
important parts from multiple part proposals in each image 
in the annotation-free scenario. The part proposals are the 
sub-regions of object proposals in each image, which are 
shown in Fig. [T] The part selection process is important 
in an annotation-free fine-grained image categorization sys¬ 
tem for at least two reasons. First, many (if not most) part 
proposals are noise and not useful for categorization. Sec¬ 
ond, accurate part detectors can be hardly acquired with¬ 
out access to detailed object and part annotations, including 
groundtruth exact object and part locations. Existing ac¬ 
curate part detectors (e.g., ll^ ) are annotation-dependent, 
different from our annotation-free setup. 

We propose to select many useful parts from multi-scale 
part proposals of objects in each image and compute a 
global image representation for it, which is then used to 
learn a linear classifier for image categorization. In this im¬ 
age representation, we believe that to select many useful 
parts is better than one exact part, because it is very difficult 
to determine the exact object / part location in the image 
in our annotation-free scenario. Multiple useful parts can 
compensate each other to provide more useful information 
in characterizing the object. The proposed representation 
achieves better accuracy than the annotation-free work ESi 
and even most existing annotation-dependent methods on 
two challenging benchmark datasets, the Caltech-UCSD 
Birds 200-2011 lf2^ and the StanfordDogs ifThl datasets. 
Its success suggests that it is not always necessary to learn 
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Figure 1. System overview. This figure is best viewed in color. 


expensive object / part detectors in fine-grained image cate¬ 
gorization. 

Fig. 0 gives an overview on how we generate a global 
representation for each image through selected parts. The 
framework in Fig.[2consists of three major steps: part pro¬ 
posal generation, useful part selection, and multi-scale im¬ 
age representation. 

• In the first step, we extract object proposal which are 
image patches that may contain an object. Then, multi¬ 
scale part proposals are extracted from object proposals 
in each image. An efficient multi-max pooling (MMP) 
strategy is proposed to generate features for multi-scale 
part proposals by leveraging the internal structure of 
CNN on object proposals. Within the large number of 
part proposals, most of them are from background clut¬ 
ters, which are harmful to categorization. 

• Thus, in the second step, we select useful part propos¬ 
als from each image by exploring useful information in 
part clusters (all part proposals are clustered). For each 
part cluster, we compute an importance score for it, indi¬ 
cating how important is this cluster for our fine-grained 
task. Those part proposals assigned to useful clusters 
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Figure 2. Black-capped Vireo and Yellow-throated Vireo. They 
have the most distinctive parts in multiple part proposals: yellow 
throat and black cap, respectively, which are specified in red boxes. 
On the right, we show the key parts detected using the proposed 
representation from the two species. More examples of detected 
discriminative parts can be found in Fig. This figure is best 
viewed in color. 

(i.e., those with largest importance scores) are selected 
as useful parts. 

• Finally, the selected part proposals in each image are 
encoded into a global image representation. In order to 
highlight the subtle distinction among fine-grained ob¬ 
jects, we encode the selected parts on different scales 
separately, which we name as SCale Pyramid Matching 
(ScPM). It provides a better discriminance than encod¬ 
ing all parts altogether in one image. 

With the proposed annotation-free fine-grained image 
representation, we can detect the key (most discrimina¬ 
tive) parts in objects for different classes, whose results 
coincide well with rules used by human experts (e.g., the 
yellow-throated vireo and black-capped vireo differs be¬ 
cause yellow-throated vireo has yellow throat while black- 
capped vireo has black head, cf. Fig.[^. 

2. Fine-grained image representation without 
using object / part annotations 

The three modules in the proposed method (part proposal 
generation, part selection, and multi-scale image represen¬ 
tation) are detailed in Sections [2. lf[2.3| respectively. 

2.1. Part proposal generation 

Regional information has been shown to improve image 
classification with hand-crafted methods like spatial pyra¬ 
mid matching ini and receptive fields ifBl . When a CNN 
model is applied on an image, features of local regions can 
be acquired automatically from its internal structure. As¬ 
sume the output from a layer in CNN is N x N x d dimen¬ 
sion, which is the output of d filters for x spatial cells. 


Each spatial cell is computed from a receptive field in the in¬ 
put image. The receptive fields of all the spatial cells in the 
input image can highly overlap with each other. The size of 
one receptive field can be computed layer by layer in CNN. 
In a convolution (pooling) layer, if the filter (pooling) size 
is a X a and the step size is s, then T xT cells in the output 
of this layer corresponds to [s(T — 1) -F a] x [s{T — 1) -F a] 
cells in the input of this layer. For example, one cell in 
the C0NV5 (the 5th convolutional) layer of CNN model 
(imagenet-vgg-m) Q corresponds to a 139 x 139 recep¬ 
tive field (it is assumed to reside in the image completely) 
in the 224 x 224 input image (cf. Fig. [^. 

The spatial cells in one CNN layer correspond to recep¬ 
tive fields with a fixed size, which are not comprehensive 
enough to characterize objects of different scales in images. 
Some efforts have been made to solve this problem. EOl ap¬ 
plied multiple convolutional filters of different sizes in the 
C0NV5 layer of CNN to generate multi-scale part mappings 
for object detection. Il30l applied CNN model on images re¬ 
sized to different scales and pool the final outputs using the 
Fisher vector method. These methods, however, are more 
time consuming than the original CNN computations. 

We generate features of multi-scale receptive fields for 
an image by leveraging the internal outputs of CNN with 
little additional computational cost (cf. Fig. [^. Consid¬ 
ering the outputs of one layer in a CNN, we can pool the 
activation vectors of adjacent cells of different sizes, which 
corresponds to receptive fields with different sizes in the in¬ 
put image. Max-pooling is used here. 

Given the N x N cells in one layer in CNN, we use 
max-pooling to combine information from all M x M adja¬ 
cent cells, where M ranges from 1 (single cell) to N (all 
the cells). When M is assigned to different values, the 
corresponding cells can cover receptive fields of different 
sizes in the input image, thus providing more comprehen¬ 
sive information. We name the proposed part proposal gen¬ 
eration strategy as multi-max pooling (MMP) and apply it 
to the C0NV5 layer, because the CONV5 layer can capture 
more meaningful object / part information than those shal¬ 
low layers in CNN OTl . When a CNN model is applied on 
an object bounding box in an image, the acquired receptive 
fields from MMP can be seen as the part candidates for an 
object, which provides a comprehensive understanding for 
fine-grained objects. 

Part proposals are important for fine-grained image cat¬ 
egorization, which can provide fine-grained information of 
objects. Object proposals generated by objectness meth¬ 
ods like selective search ll24ll are not fine-grained enough 
to characterize the internal structure of fine-grained objects. 
ll^ only used such object proposals to detect parts with the 
help of object / part annotations, which leads to worse per¬ 
formance than our annotation-free method using part pro¬ 
posals (cf. Table|^. 
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Figure 3. Receptive fields computed using the CNN model (imagenet-vgg-m) (4). One cell in the CONVS layer corresponds to a 139 x 139 
receptive field in the input image. We only show the spatial size of the image and filters, a x a is the filter (pooling) size, ‘st’ means the 
step size. 


Object proposal (CNN input) 



Figure 4. Generating multi-scale part proposals. The input is an 
object proposal. By applying CNN on it, spatial cells of different 
sizes on the CONVS layer in CNN correspond to parts of different 
scales. This figure is best viewed in color. 

MMP is an efficient way to generate multi-scale part pro¬ 
posals to characterize fine-grained objects. It can be easily 
applied on millions or even billions object proposals in a 
dataset. Unlike 1^ . where the outputs of CONV4 in CNN 
are used as parts, MMP provides dense coverage on differ¬ 
ent scales from part level to object level for each object pro¬ 
posal. The large number of part proposals provide us more 
opportunity to mine subtle useful information of objects. 

2.2. Part selection 

We propose to select useful parts from global image rep¬ 
resentations. To get the image representation, we first gen¬ 
erate object proposals from each image. Since no object / 
part annotations are provided, we could only use unsuper¬ 
vised object detection methods. Considering the efficiency, 
selective search ll24l is used in our framework, which has 
also been used in uni ED to generate initial object / part 
candidates for object detectors. After generating multiple 
object proposals, we apply CNN model on each detected 
bounding box / object proposal, and use the proposed MMP 
to get part proposals. 

Among the object proposals, most of them are from 
background clutters, which are harmful for image recogni¬ 
tion. For example, in the CUB200-2011 ll26ll dataset, when 


we use the intersection over union criteria, only 10.4% ob¬ 
ject proposals cover the foreground object. The part propos¬ 
als from those unsuccessful object proposals will contribute 
little to the classification, or even be noisy and harmful. 
Thus, we need to select those useful part proposals (cov¬ 
ering the foreground object) but without using groundtruth 
annotations for our image representation. 

We select useful parts through mining the useful infor¬ 
mation in part clusters. We first cluster all part proposals 
into several groups. Then, we compute the importance of 
each cluster for image classification. Those part proposals 
assigned to the useful clusters (clusters with largest impor¬ 
tance values) are selected as the useful parts. 

We compute the cluster importance with the aid of Fisher 
vector (FV) ll22ll [^ We first encode all the part proposals 
in each image into a FV. Then, for each dimension in FVs 
of all training images, we compute its importance using its 
mutual information (MI) with the class labels Finally, 
the cluster importance is the summation of the MI values of 
all FV dimensions in it. We only keep those FV dimensions 
from the most important clusters for image categorization. 

To the best of our knowledge, explicit part proposal and 
selection is for the first time proposed for fine-grained im¬ 
age categorization, in an annotation-free setup. As will be 
shown in Sec. this novel strategy greatly improves cat¬ 
egorization accuracy, even when object or part annotations 
are not used at all. Part selection can automatically explore 
those parts which are important for categorization by using 
image labels. It is more efficient and practical than trying 
to learn explicit part detectors without groundtruth object / 
part annotations. Il28l also worked on fine-grained catego¬ 
rization without object / part annotations, which costs much 
more computation than ours. Il28l used two CNN models 
to detect interesting objects and further learn accurate part 
detectors from them. In contrast, we only need to select im¬ 
portant parts from all part proposals, which are generated 

*VLAD can be used in our framework, which is used in (H to en¬ 
code CNN of multiple spatial regions for general image classification. We 
choose FV because it has a better discriminance than VLAD 03. 
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by applying one CNN model on object proposals. More 
importantly, our method shows that without explicitly de¬ 
tecting the fine-grained objects / parts, the proposed image 
representation can acquire a better discriminance than 1 ^ 
(cf. Table|^. 

2.3. Multi-scale image representation 

We select important part clusters for parts on different 
scales separately. Aggregating part proposals from differ¬ 
ent scales altogether into a single image representation can¬ 
not highlight the subtle distinction in fine-grained images. 
Thus, we propose to encode part proposals in an image on 
different scales separately and we name it as SCale Pyramid 
Matching (ScPM). For part proposals on different scales, 
we compute separate FVs. In practice, the scale number 
can be very large (N = 13 in the CNN model in our pa¬ 
per), which may lead to severe memory problem. Since 
the part proposals on neighboring scales are similar in size, 
we can divide all the scales into m {m < N) groups 

C {l,...,iV}}. For an image 
I, the part proposals belonging to the scale group g{j) are 
used to compute one FV 4>j{I) as the following: 
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where {wj, fif, crl} are respectively the mixture weights, 
mean vectors, and standard deviation vectors of the i-th se¬ 
lected diagonal Gaussian in the j-th scale group g{j),j = 
1 ,..., m. {xt} are the selected part proposals in an im¬ 
age, c(f) is the scale index of the f-th part and 7 / (i) is the 
weight of the f-th instance to the f-th Gaussian in the j-th 
scale group. Following ll22ll . two parts corresponding to the 
mean and the standard deviation in each Gaussian of FV 
are used. Each of the m FVs is power and £2 normalized in¬ 
dependently, and then concatenated to represent the whole 
image as 


<^( 7 ) = [</.!(/),..., 


(4) 


ScPM is different from the Multi-scale Pyramid Pooling 
(MPP) method in ll^ . On one hand, MPP encodes local 
features from images resized on different scales into sepa¬ 
rate FVs, and aggregate all the FVs into one to represent 
an image. Such aggregation may not highlight the subtle 
difference of object parts on different scales, which is es¬ 
pecially important in fine-grained objects within complex 

^The source code will be published. 


backgrounds. On the other hand, ScPM automatically se¬ 
lects different number of important part clusters on different 
scales. The FV representations from all scales do not have 
the same length and cannot be aggregated as MPP. 

3. Understand subtle visual differences: with 

the help of key part detection 

We want to detect and show the key (most discrimina¬ 
tive) parts in fine-grained images of different classes to give 
more insightful understanding of the critical property in ob¬ 
jects, which may help us in feature design for fine-grained 
images. Note that we only have image labels in training 
images. In order to find the key parts in a class, we need 
to propagate the training image labels to parts in objects. 
Label propagation is also used in ll^ on their feature rep¬ 
resentation to compute the object confidence map in general 
image recognition. 

Suppose we want to interpret how yellow-throated vireo 
is different from black-capped vireo (illustrated in Fig. |^, 
we consider a binary classification problem where yellow- 
throated vireo and black-capped vireo are the positive and 
negative classes, respectively. We will compute a score for 
each part to denote its importance in this binary classifica¬ 
tion. A part with the largest score means that it is essential 
for yellow-throated vireo, and a part with the smallest score 
(i.e., the most negative score) is key to black-capped vireo. 

We learn a max-margin binary classifier in each selected 
part cluster to compute the part score. This classifier is used 
to propagate the image labels to parts. In the training phase, 
for each selected part cluster, we aggregate the part features 
in one image assigned into this cluster altogether (similar to 
VLAD). The aggregated features of training images are £2 
normalized and then used to train a classifier with image la¬ 
bels. In the testing phase, given a part, its score is computed 
as the dot-product between the classifier in the part cluster it 
falls in (only consider those parts in the selected part clus¬ 
ters) and its feature (the CNN activation vector). In both 
training and testing processes, the part features are centered 
(i.e., minus the cluster center in each part cluster). 

4. Experiments 

In this section, we evaluate the proposed annotation-free 
method on fine-grained categorization. The selective search 
method ll24l with default parameters is used to generate ob¬ 
ject proposals for each image. The pre-learned CNN mod¬ 
els m from ImageNet are used to extract feature from each 
object proposal as im, which has been shown to achieve 
state-of-the-art results. It is fine-tuned with training images 
and their labels. But we do not fine tune CNN on object 
proposals because many of them are from background clut¬ 
ters, which may deteriorate the CNN performance. We use 
the ‘imagenet-vgg-m’ model H on object proposals, given 
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Table 1. Evaluation of different modules in the proposed image 
representation. 



Accuracy (%) 

CONv5h-MMPh-ScPM 

71.04 

CONv5h-MMP 

68.78 

C0NV5 

58.15 


that its efficiency and accuracy are both satisfactory. 

The part proposals in each scale group are assigned into 
128 clusters. Each part feature is reduced into 128 dimen¬ 
sions by PC A. All 13 part scales (N = 13 in the CNN 
model) are divided into 8 scale groups: the first 4 scales 
form the first 4 groups, the subsequent 6 scales form 3 
groups with 2 scales in one group, and the last 3 scales form 
the last scale group. This arrangement make the number of 
parts in each group roughly balanced. The dimension of the 
final image representation using FV is: 128 x 2 x 128 x 8 = 
262144, from which different fractions of useful part clus¬ 
ters will be selected and evaluated. 

We evaluate the proposed method on two benchmark 
fine-grained datasets: 

• CUB200-2011 The Caltech-UCSD Birds 200- 
2011 dataset contains 200 different bird classes. It in¬ 
cludes 5994 training images and 5794 testing images in 
total. 

• StanfordDogs lUfil : This dataset contains 120 different 
types of dogs and includes 20,580 images in total. 

For both datasets, we only use the class labels of images in 
the training stage. 

We choose FIBFINEAR Q to learn a linear SVM clas¬ 
sifier for classification. All the experiments are run on a 
computer with Intel i7-3930K CPU and 64G main memory. 

4.1. Influences of different modules 

We investigate the effect of different modules in the pro¬ 
posed image representation on the CUB 200-2011 dataset 
in Table [T] 

First, we consider the effect of MMP in the proposed 
image representation. We compare the part proposals gen¬ 
erated using the outputs of CONV5 and CONV5 h-MMP. All 
part proposals are encoded into one FV in each image (not 
using ScPM). It can be seen that multi-scale part proposals 
(CONV5 h-MMP) can greatly improve the recognition accu¬ 
racy over single-scale part proposals (CONV5) by 10.63%. 
This is because MMP can provide very dense coverage of 
object parts on different scales. 

Second, we evaluate the influence of ScMP in the pro¬ 
posed image representation. Using the multi-scale part pro¬ 
posals generated by MMP, ScMP has a better accuracy 
(2.26% higher) than that of the method encoding all part 
proposals altogether. This shows that it is beneficial to en¬ 
code objects at different scales separately. 


Table 2. Classification accuracy on Caltech-UCSD Birds 200- 
2011 ._ 


Without annotations in neither training nor testing 

Methods 

Selection fraction 

Acc. (%) 


100% (All) 

71.04 


75.0% (3/4) 

71.67 

Proposed 

50.0% (1/2) 

73.34 


25.0% (1/4) 

75.02 


12.5% (1/8) 

73.82 

Two-level attention 11^ 

69.70 

Use annotations in training, not in testing 

DPD-bDeCAF IS 

44.94 

Part based R-CNN (without parts) ll32l 

52.38 

Part based R-CNN-ft (without parts) ||3^ 

62.75 

CF-45C (without parts) ifTSll 

73.50 

Part based R-CNN-ft (with parts) ll32l 

73.89 

Pose Normalized CNN [SI 

75.70 


Up to now, MMPh-ScMP has shown better accuracy than 
the state-of-the-art annotation-free fine-grained categoriza¬ 
tion method ll28ll by 1.34%. Next, we are going to further 
improve the accuracy with part selection on this representa¬ 
tion. 

4.2. Part selection 

We show the classification accuracy using part selection 
on the proposed image representation (MMPh-ScMP) for 
CUB 200-2011 in Table 121 

Part selection can greatly improve the accuracy. The ac¬ 
curacy is shown when different fractions of part clusters are 
selected in the image representation. When a quarter of 
most important part clusters (fraction 25%) are used, a peak 
is reached, and it is better than that without part selection 
(fraction 100%) by 3.98%. Even when fewer part proposals 
are selected (fraction 12.5%), its accuracy is still better than 
that without part selection by 2.78%. This shows that part 
selection can efficiently resist the noise introduced by those 
part proposals from background clutters. 

Our best accuracy (75.02%) outperforms the state-of- 
the-art annotation-free method ll^ by 5.32%, and is also 
better than most existing annotation-dependent works. ESl 
claims better results can be acquired if more powerful CNN 
is used. For fair comparison, we cite their results using the 
standard CNN structure (containing 5 convolutional layers). 
We only show the accuracy of annotation-dependent meth¬ 
ods using object / part annotations in the training stage, 
which uses the least annotations and is most close to our 
annotation-free setup. Most of these methods try to learn 
expensive part detectors to get accurate matching for recog¬ 
nition. However, our method shows that they are not always 
necessary, especially in annotation-free fine-grained catego¬ 
rization. 

Part selection is more important in fine-grained catego- 
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Table 3. Classification accuracy on StanfordDogs. 


Without annotations in neither training nor testing 

Methods 

Selection fraction 

Acc. (%) 


100% (All) 

77.23 


75.0% (3/4) 

78.28 

Proposed 

50.0% (1/2) 

79.36 


25.0% (1/4) 

79.92 


12.5% (1/8) 

78.18 

Two-level attention 

71.90 

Use annotations in both training and testing 

Edge templates B29II 

38.00 

Unsupervised alignments ifTOl 

50.10 

MTL im 


39.30 


rization than (feature) selection in general image categoriza¬ 
tion. With part selection, the accuracy is 3.98% higher than 
the original image representation. In ||36|, feature selection 
is used to compress FV for general image recognition like 
object recognition. Much smaller (around 1%) improve¬ 
ment after selection (worse in most time) is achieved over 
original FV, which is much less than the improvement in Ta¬ 
ble]^ This fact clearly shows the distinction between these 
two applications. In annotation-free fine-grained tasks, se¬ 
lecting proper object parts are critical, while in general im¬ 
age recognition the global image representation without se¬ 
lection is already good. 

We show the categorization accuracy for Stanford Dogs 
in Table The proposed method (either with or without 
part selection) shows much better accuracy than existing 
annotation-dependent works. Part selection also shows to 
play an important role in the proposed image representa¬ 
tion, which leads to a 2.69% improvement over the original 
representation. Stanford Dogs is a subset in ImageNet. It is 
also evaluated in ESi . which gets worse result than ours. 

Overall, these results show that: 1) part selection is im¬ 
portant in annotation-free fine-grained categorization; 2) it 
is not always necessary to learn expensive object / part de¬ 
tectors in fine-grained categorization. 

4.3. Key part visualization 

We detect and visualize the key parts for pairwise classes 
using the proposed image / part representation in Fig. In 
each pair, we show one sample image and 20 detected key 
parts with highest (smallest) scores for the positive (nega¬ 
tive) class. The bird names are given in the captions, which 
also shows how humans distinguish the two birds. 

The detected parts can capture the key parts in these 
species, which coincides well with the human-defined rules. 
We also find that the proposed method can capture some 
tiny distinction that we cannot easily discriminate by eyes. 
For example, in the first pair, the key parts in the red-bellied 
woodpecker and red-headed woodpecker are both red, and 


the locations are very close. From the detected parts, we can 
find that the red color of red-headed woodpecker is darker 
and the feather of red-bellied woodpecker is finer. 

From the detected parts, we can see the necessity to se¬ 
lect many useful parts in the proposed image representation. 
One (best) part may cause possible loss of useful informa¬ 
tion in characterizing an object. Multiple (good) parts can 
compensate each other from different aspects like location, 
view, and scale, etc. This also explains why the proposed 
representation works better than ESl . which only use de¬ 
tected best part for categorization. 

5. Conclusions 

In this paper, we propose to categorize fine-grained im¬ 
ages without using any object / part annotation neither in 
the training nor in the testing stage. Our basic idea is to 
select multiple useful parts from multi-scale part propos¬ 
als and use them to compute a global image representa¬ 
tion for categorization. This is specially designed for fine¬ 
grained categorization in this annotation-free scenario, be¬ 
cause parts have shown to play an important role in existing 
annotation-dependent works and accurate part detectors are 
hardly acquired. Particularly, we propose an efficient multi¬ 
max pooling strategy to generate multi-scale part proposals 
by using the internal outputs of CNN on object proposals 
in each image. Then, we select useful parts from those part 
clusters which are important for categorization. Finally, we 
encode the selected parts on different scales separately in a 
global image representation. With the proposed image / part 
representation technique, we use it to detect the key parts in 
objects of different classes, whose visualization results are 
intuitive and coincide well with rules used by human ex¬ 
perts. 

In the experiments, on two challenging datasets (the 
CUB 200-2011 and the StanfordDogs datasets), our pro¬ 
posed annotation-free method achieves classification accu¬ 
racy of 75.02% and 79.92% respectively, which is better 
than the results of state-of-the-art annotation-free work and 
most existing annotation-dependent methods. Future works 
include utilizing the objection information mined from the 
global image representation to help localize objects and fur¬ 
ther improve classification. 
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learned from training images. Top 20 key parts are shown for each class. The important parts found by the proposed method coincide well 
with the rules human experts use to distinguish these birds. This figure is best viewed in color. 
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